Enable DSA CP/absorbed/THD paths with TileLang fused ops by HollowMan6 · Pull Request #3674 · NVIDIA/Megatron-LM

HollowMan6 · 2026-03-03T10:15:10Z

What does this PR do ?

This PR upgrades the DSA path end-to-end to support context parallel (allgather CP) with THD packing support, absorbed MLA integration, and fused TileLang kernels with safe fallbacks.

Need together with #3026

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

copy-pr-bot · 2026-03-03T10:15:13Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copilot

Pull request overview

This PR extends the experimental DSAttention (DSA) path to support context parallelism (allgather CP) and packed THD masking, adds an “absorbed MLA” integration path, and introduces TileLang-based fused kernels (indexer + sparse MLA) with fallbacks and expanded unit coverage.

Changes:

Enable DSAttention CP allgather masking and packed THD (varlen) masking, including sparse-KL streaming for indexer loss.
Integrate absorbed-MLA tensor rewrite in MultiLatentAttention and route absorbed execution through DSAttention (with optional fused SparseMLA).
Add TileLang fused kernels/interfaces for indexer and sparse MLA, plus extensive new unit tests for CP/THD/absorbed parity and fused plumbing.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
tests/unit_tests/transformer/test_attention_variant_dsa.py	Adds coverage for CP layout helpers, packed THD varlen masking parity, fused indexer loop behavior, streaming sparse-KL, and absorbed parity.
megatron/core/transformer/transformer_config.py	Removes the config-time guard that previously disallowed DSA with context parallelism.
megatron/core/transformer/multi_latent_attention.py	Adds absorbed MLA tensor rewrite and passes `position_ids`/`up_v_weight` into DSA for CP + absorbed execution.
megatron/core/transformer/experimental_attention_variant/ops/tilelang_sparse_mla_fwd.py	New TileLang sparse-MLA forward kernel + Python interface.
megatron/core/transformer/experimental_attention_variant/ops/tilelang_sparse_mla_bwd.py	New TileLang sparse-MLA backward kernels + Python interface.
megatron/core/transformer/experimental_attention_variant/ops/tilelang_indexer_fwd.py	New TileLang fused indexer forward kernel + logits “cleaning” kernel.
megatron/core/transformer/experimental_attention_variant/ops/tilelang_indexer_bwd.py	New TileLang fused indexer backward kernel + Python interface.
megatron/core/transformer/experimental_attention_variant/ops/sparse_mla.py	Autograd wrapper around TileLang sparse-MLA forward/backward.
megatron/core/transformer/experimental_attention_variant/ops/indexer.py	Autograd wrapper around TileLang indexer forward/backward and a helper for extracting top-k scores.
megatron/core/transformer/experimental_attention_variant/dsa.py	Core DSA updates: CP position/masking helpers, varlen bounds, fused top-k + streaming sparse-KL, scratch caching, absorbed sparse attention routing, and updated loss/masking plumbing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

megatron/core/transformer/experimental_attention_variant/ops/tilelang_sparse_mla_bwd.py

megatron/core/transformer/experimental_attention_variant/ops/tilelang_indexer_bwd.py

megatron/core/transformer/experimental_attention_variant/ops/indexer.py

megatron/core/transformer/experimental_attention_variant/ops/tilelang_indexer_fwd.py

megatron/core/transformer/experimental_attention_variant/dsa.py

megatron/core/transformer/experimental_attention_variant/ops/tilelang_sparse_mla_fwd.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e0d5681007

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

megatron/core/transformer/experimental_attention_variant/ops/indexer.py

megatron/core/transformer/experimental_attention_variant/ops/tilelang_indexer_fwd.py

This PR upgrades the DSA path end-to-end to support context parallel (allgather CP) with THD packing support, absorbed MLA integration, and fused TileLang kernels with safe fallbacks. Signed-off-by: Hollow Man <hollowman@opensuse.org>

Signed-off-by: Hollow Man <hollowman@opensuse.org>

Phlip79 · 2026-03-04T23:44:15Z

We are changing our review process and marking all open, unlabeled PRs as draft. This change will go in effect starting once #3659 is merged.

Moving forward, all PRs will be required to start as draft PRs. If you wish to get your PR merged, mark your PR as “Ready for review”. Read more about the new process at submit.md.

HollowMan6 requested a review from a team as a code owner March 3, 2026 10:15

Copilot AI review requested due to automatic review settings March 3, 2026 10:15

HollowMan6 requested a review from a team as a code owner March 3, 2026 10:15

github-actions bot added the community-request label Mar 3, 2026

svcnvidia-nemo-ci requested a review from a team March 3, 2026 10:15

Copilot started reviewing on behalf of HollowMan6 March 3, 2026 10:15 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Mar 3, 2026

View reviewed changes

HollowMan6 mentioned this pull request Mar 3, 2026

Add GLM5 support NVIDIA-NeMo/Megatron-Bridge#2469

Open

5 tasks

HollowMan6 added 5 commits March 4, 2026 22:02

Add cache for compiled kernel

1223eb5

Signed-off-by: Hollow Man <hollowman@opensuse.org>

Increase chunk size for better peformance

bf05dad

Signed-off-by: Hollow Man <hollowman@opensuse.org>

fix tilelang recompile under pp>1

30203d0

Signed-off-by: Hollow Man <hollowman@opensuse.org>

threading lock for compiling

f133b53

Signed-off-by: Hollow Man <hollowman@opensuse.org>

HollowMan6 force-pushed the dsa_cp_thd branch from dbfb853 to f133b53 Compare March 4, 2026 20:02

No recompile

9ad3f65

Signed-off-by: Hollow Man <hollowman@opensuse.org>

Phlip79 marked this pull request as draft March 4, 2026 23:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable DSA CP/absorbed/THD paths with TileLang fused ops#3674

Enable DSA CP/absorbed/THD paths with TileLang fused ops#3674
HollowMan6 wants to merge 6 commits intoNVIDIA:mainfrom
HollowMan6:dsa_cp_thd

HollowMan6 commented Mar 3, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

HollowMan6 commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot bot commented Mar 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HollowMan6 commented Mar 3, 2026 •

edited

Loading

(Step 1): Add PR label `Expert Review`